13 research outputs found
Semantic data integration and knowledge graph creation at scale
Contrary to data, knowledge is often abstract. Concrete knowledge can be achieved through the inclusion of semantics in the data models, highlighting the role of data integration. The massive growing number of data, in recent years, has promoted the demand for scaling up data management techniques; materializing data integration, a.k.a., knowledge graph creation falls in that category.
In this thesis, we investigate efficient methods and techniques for materializing data integration. We formalize the process of materializing data integration. We formally define the characteristics of a materialized data integration system that merge the data operators and sources. Owing to this formalism, both layers of data integration, including data and schema-level integration, are formalized in the context of mapping assertions. We explore optimization opportunities for improving the materialization of data integration systems. We recognize three angles including intra/inter-mapping assertions from which the materialization can be improved. Accordingly, we propose source-based, mapping-based, and inter-mapping assertion groups of optimization techniques. We utilize our proposed techniques in three real-world projects. We illustrate how applying these optimization techniques contribute to meeting the objectives of the mentioned projects.
Furthermore, we study the parameters impacting the performance of materialization of data integration. Relying on reported parameters and the presumably impacting parameters, we build four groups of testbeds. We empirically study the performances of these different testbeds in the presence and absence of our proposed techniques, in terms of execution time. We observe that the savings can be up to 75%.
Lastly, we contribute to facilitating the process of declarative data integration system definition. We propose two data operation function signatures in Function Ontology (FnO). The first set of functions is designed to perform the task of entity alignment by resorting to an entity and relation linking tool. The second library consists of domain-specific functions to align genomic entities by harmonizing their representations. Finally, we introduce a tool equipped with a user interface to facilitate the process of defining declarative mapping rules by allowing users to explore the data sources and unified schema while defining their correspondences.Im Gegensatz zu den Daten ist das Wissen oft abstrakt. Konkretes Wissen kann
durch die Einbeziehung von Semantik in die Datenmodelle erreicht werden, was die
Rolle der Datenintegration unterstreicht. Die massiv wachsende Zahl von Daten hat
in den letzten Jahren die Nachfrage nach einer Ausweitung der Datenverwaltungstechnikengef¨ordert; die materialisierende Datenintegration, auch bekannt als die Erstellung von Wissensgraphen, f¨allt in diese Kategorie.
In dieser Arbeit untersuchen wir effiziente Methoden und Techniken zur Materialisierung der Datenintegration. Wir formalisieren den Prozess der Materialisierung der Datenintegration. Wir definieren formal die Eigenschaften eines materialisierten Datenintegrationssystems, so dass die Datenoperatoren und -quellen zusammengef¨uhrt werden. Dank dieses Formalismus werden beide Ebenen der Datenintegration, einschließlich der Integration auf Daten- und Schemaebene, im Kontext von Mapping-Assertions formalisiert. Wir untersuchen die Optimierungsm¨oglichkeiten zur Verbesserung der Materialisierung von Datenintegrationssystemen. Wir erkennen drei Gesichtspunkte, einschließlich Intra-/Inter-Mapping-Assertions, unter denen die Materialisierung verbessert werden kann. Dementsprechend schlagen wir quellenbasierte, mappingbasierte und inter-mapping Assertionsgruppen von Optimierungstechniken vor. Wir setzen die von uns vorgeschlagenen Techniken in drei Forschungsprojekte ein. Wir veranschaulichen, wie die Anwendung dieser Optimierungstechniken dazu beitr¨agt, die Ziele der genannten Projekte zu erreichen. Wir untersuchen die Parameter, die sich auf die Leistung der Materialisierung der Datenintegration auswirken. Auf der Grundlage der gemeldeten Parameter und der vermutlich ausschlaggebenden Parameter erstellen wir vier Gruppen von Testumgebungen.
Wir untersuchen empirisch die Leistung dieser verschiedenen Testbeds mit
und ohne die von uns vorgeschlagenen Techniken in Bezug auf die Ausf¨uhrungszeit.
Wir stellen fest, dass die Einsparungen bis zu 75% betragen k¨onnen.
Schließlich tragen wir zur Erleichterung des Prozesses der deklarativen Definition
von Datenintegrationssystemen bei, indem wir zwei Funktionssignaturen f¨ur Datenoperationen
in der Function Ontology (FnO) vorschlagen. Die erste Gruppe von
Funktionen ist f¨ur die Aufgabe des Entit¨atsabgleichs konzipiert, w¨ahrend die zweite
Bibliothek aus dom¨anenspezifischen Funktionen zum Abgleich genomischer Entit¨aten
durch Harmonisierung ihrer Darstellungen besteht. Schließlich stellen wir ein Tool
vor, das mit einer Benutzeroberfl¨ache ausgestattet ist, um den Prozess der Definition
deklarativer Mapping-Regeln zu erleichtern, indem es den Benutzern erm¨oglicht, die
Datenquellen und das einheitliche Schema zu erkunden
MapSDI: A Scaled-up Semantic Data Integration Framework for Knowledge Graph Creation
Semantic web technologies have significantly contributed with effective
solutions for the problems of data integration and knowledge graph creation.
However, with the rapid growth of big data in diverse domains, different
interoperability issues still demand to be addressed, being scalability one of
the main challenges. In this paper, we address the problem of knowledge graph
creation at scale and provide MapSDI, a mapping rule-based framework for
optimizing semantic data integration into knowledge graphs. MapSDI allows for
the semantic enrichment of large-sized, heterogeneous, and potentially
low-quality data efficiently. The input of MapSDI is a set of data sources and
mapping rules being generated by a mapping language such as RML. First, MapSDI
pre-processes the sources based on semantic information extracted from mapping
rules, by performing basic database operators; it projects out required
attributes, eliminates duplicates, and selects relevant entries. All these
operators are defined based on the knowledge encoded by the mapping rules which
will be then used by the semantification engine (or RDFizer) to produce a
knowledge graph. We have empirically studied the impact of MapSDI on existing
RDFizers, and observed that knowledge graph creation time can be reduced on
average in one order of magnitude. It is also shown, theoretically, that the
sources and rules transformations provided by MapSDI are data-lossless
Dragoman: Efficiently Evaluating Declarative Mapping Languages over Frameworks for Knowledge Graph Creation
In recent years, there have been valuable efforts and contributions to make
the process of RDF knowledge graph creation traceable and transparent;
extending and applying declarative mapping languages is an example. One
challenging step is the traceability of procedures that aim to overcome
interoperability issues, a.k.a. data-level integration. In most pipelines, data
integration is performed by ad-hoc programs, preventing traceability and
reusability. However, formal frameworks provided by function-based declarative
mapping languages such as FunUL and RML+FnO empower expressiveness. Data-level
integration can be defined as functions and integrated as part of the mappings
performing schema-level integration. However, combining functions with the
mappings introduces a new source of complexity that can considerably impact the
required number of resources and execution time. We tackle the problem of
efficiently executing mappings with functions and formalize the transformation
of them into function-free mappings. These transformations are the basis of an
optimization process that aims to perform an eager evaluation of function-based
mapping rules. These techniques are implemented in a framework named Dragoman.
We demonstrate the correctness of the transformations while ensuring that the
function-free data integration processes are equivalent to the original one.
The effectiveness of Dragoman is empirically evaluated in 230 testbeds composed
of various types of functions integrated with mapping rules of different
complexity. The outcomes suggest that evaluating function-free mapping rules
reduces execution time in complex knowledge graph creation pipelines composed
of large data sources and multiple types of mapping rules. The savings can be
up to 75%, suggesting that eagerly executing functions in mapping rules enable
making these pipelines applicable and scalable in real-world settings
Recommended from our members
FunMap: Efficient Execution of Functional Mappings for Knowledge Graph Creation
Data has exponentially grown in the last years, and knowledge graphs constitute powerful formalisms to integrate a myriad of existing data sources. Transformation functions – specified with function-based mapping languages like FunUL and RML+FnO – can be applied to overcome interoperability issues across heterogeneous data sources. However, the absence of engines to efficiently execute these mapping languages hinders their global adoption. We propose FunMap, an interpreter of function-based mapping languages; it relies on a set of lossless rewriting rules to push down and materialize the execution of functions in initial steps of knowledge graph creation. Although applicable to any function-based mapping language that supports joins between mapping rules, FunMap feasibility is shown on RML+FnO. FunMap reduces data redundancy, e.g., duplicates and unused attributes, and converts RML+FnO mappings into a set of equivalent rules executable on RML-compliant engines. We evaluate FunMap performance over real-world testbeds from the biomedical domain. The results indicate that FunMap reduces the execution time of RML-compliant engines by up to a factor of 18, furnishing, thus, a scalable solution for knowledge graph creation
SDM-RDFizer: An RML Interpreter for the Efficient Creation of RDF Knowledge Graphs
In recent years, the amount of data has increased exponentially, and knowledge graphs have gained attention as data structures to integrate data and knowledge harvested from myriad data sources. However, data complexity issues like large volume, high-duplicate rate, and heterogeneity usually characterize these data sources, being required data management tools able to address the negative impact of these issues on the knowledge graph creation process. In this paper, we propose the SDM-RDFizer, an interpreter of the RDF Mapping Language (RML), to transform raw data in various formats into an RDF knowledge graph. SDM-RDFizer implements novel algorithms to execute the logical operators between mappings in RML, allowing thus to scale up to complex scenarios where data is not only broad but has a high-duplication rate. We empirically evaluate the SDM-RDFizer performance against diverse testbeds with diverse configurations of data volume, duplicates, and heterogeneity. The observed results indicate that SDM-RDFizer is two orders of magnitude faster than state of the art, thus, meaning that SDM-RDFizer an interoperable and scalable solution for knowledge graph creation. SDM-RDFizer is publicly available as a resource through a Github repository and a DOI
SDM-TIB/SDM-RDFizer: v4.7.2.1
An Efficient RML-Compliant Engine for Knowledge Graph Constructio
The RML Ontology: A Community-Driven Modular Redesign After a Decade of Experience in Mapping Heterogeneous Data to RDF
International audienceThe Relational to RDF Mapping Language (R2RML) became a W3C Recommendation a decade ago. Despite its wide adoption, its potential applicability beyond relational databases was swiftly explored. As a result, several extensions and new mapping languages were proposed to tackle the limitations that surfaced as R2RML was applied in real-world use cases. Over the years, one of these languages, the RDF Mapping Language (RML), has gathered a large community of contributors, users, and compliant tools. So far, there has been no well-defined set of features for the mapping language, nor was there a consensus-marking ontology. Consequently, it has become challenging for non-experts to fully comprehend and utilize the full range of the language's capabilities. After three years of work, the W3C Community Group on Knowledge Graph Construction proposes a new specification for RML. This paper presents the new modular RML ontology and the accompanying SHACL shapes that complement the specification. We discuss the motivations and challenges that emerged when extending R2RML, the methodology we followed to design the new ontology while ensuring its backward compatibility with R2RML, and the novel features which increase its expressiveness. The new ontology consolidates the potential of RML, empowers practitioners to define mapping rules for constructing RDF graphs that were previously unattainable, and allows developers to implement systems in adherence with [R2]RML
Knowledge4COVID-19: A Semantic-based Approach for Constructing a COVID-19 related Knowledge Graph from Various Sources and Analysing Treatments' Toxicities
In this paper, we present Knowledge4COVID-19, a framework that aims to
showcase the power of integrating disparate sources of knowledge to discover
adverse drug effects caused by drug-drug interactions among COVID-19 treatments
and pre-existing condition drugs. Initially, we focus on constructing the
Knowledge4COVID-19 knowledge graph (KG) from the declarative definition of
mapping rules using the RDF Mapping Language. Since valuable information about
drug treatments, drug-drug interactions, and side effects is present in textual
descriptions in scientific databases (e.g., DrugBank) or in scientific
literature (e.g., the CORD-19, the Covid-19 Open Research Dataset), the
Knowledge4COVID-19 framework implements Natural Language Processing. The
Knowledge4COVID-19 framework extracts relevant entities and predicates that
enable the fine-grained description of COVID-19 treatments and the potential
adverse events that may occur when these treatments are combined with
treatments of common comorbidities, e.g., hypertension, diabetes, or asthma.
Moreover, on top of the KG, several techniques for the discovery and prediction
of interactions and potential adverse effects of drugs have been developed with
the aim of suggesting more accurate treatments for treating the virus. We
provide services to traverse the KG and visualize the effects that a group of
drugs may have on a treatment outcome. Knowledge4COVID-19 was part of the
Pan-European hackathon#EUvsVirus in April 2020 and is publicly available as a
resource through a GitHub repository
(https://github.com/SDM-TIB/Knowledge4COVID-19) and a DOI
(https://zenodo.org/record/4701817#.YH336-8zbol)